In this project I will analyze Anxiety of Gamers. I have used multiple visualization and have tried to classify if gamer has concerning anxiety level or not on basis of their nature.
I used the combination of GAD (General Anxiety Disorder), SWL (Satisfaction with Life) and SPIN (Social Phobia Inventory) scores to build the target variable.
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
import plotly.figure_factory as ff
import plotly.express as px
from kmodes.kprototypes import KPrototypes
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn import metrics
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
from sklearn.ensemble import AdaBoostClassifier
import warnings
warnings.filterwarnings('ignore')
df = pd.read_csv("GamingStudy_data.csv", encoding= 'unicode_escape')
df.head()
| S. No. | Timestamp | GAD1 | GAD2 | GAD3 | GAD4 | GAD5 | GAD6 | GAD7 | GADE | ... | Birthplace | Residence | Reference | Playstyle | accept | GAD_T | SWL_T | SPIN_T | Residence_ISO3 | Birthplace_ISO3 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 42052.00437 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | Not difficult at all | ... | USA | USA | Singleplayer | Accept | 1 | 23 | 5.0 | USA | USA | |
| 1 | 2 | 42052.00680 | 1 | 2 | 2 | 2 | 0 | 1 | 0 | Somewhat difficult | ... | USA | USA | Multiplayer - online - with strangers | Accept | 8 | 16 | 33.0 | USA | USA | |
| 2 | 3 | 42052.03860 | 0 | 2 | 2 | 0 | 0 | 3 | 1 | Not difficult at all | ... | Germany | Germany | Singleplayer | Accept | 8 | 17 | 31.0 | DEU | DEU | |
| 3 | 4 | 42052.06804 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | Not difficult at all | ... | USA | USA | Multiplayer - online - with online acquaintanc... | Accept | 0 | 17 | 11.0 | USA | USA | |
| 4 | 5 | 42052.08948 | 2 | 1 | 2 | 2 | 2 | 3 | 2 | Very difficult | ... | USA | South Korea | Multiplayer - online - with strangers | Accept | 14 | 14 | 13.0 | KOR | USA |
5 rows × 55 columns
df.columns
Index(['S. No.', 'Timestamp', 'GAD1', 'GAD2', 'GAD3', 'GAD4', 'GAD5', 'GAD6',
'GAD7', 'GADE', 'SWL1', 'SWL2', 'SWL3', 'SWL4', 'SWL5', 'Game',
'Platform', 'Hours', 'earnings', 'whyplay', 'League', 'highestleague',
'streams', 'SPIN1', 'SPIN2', 'SPIN3', 'SPIN4', 'SPIN5', 'SPIN6',
'SPIN7', 'SPIN8', 'SPIN9', 'SPIN10', 'SPIN11', 'SPIN12', 'SPIN13',
'SPIN14', 'SPIN15', 'SPIN16', 'SPIN17', 'Narcissism', 'Gender', 'Age',
'Work', 'Degree', 'Birthplace', 'Residence', 'Reference', 'Playstyle',
'accept', 'GAD_T', 'SWL_T', 'SPIN_T', 'Residence_ISO3',
'Birthplace_ISO3'],
dtype='object')
df = df.drop(columns=['S. No.', 'Timestamp', 'League', 'highestleague', 'Narcissism', 'Birthplace', 'Residence',
'accept', 'Birthplace_ISO3', 'GAD1', 'GAD2', 'GAD3', 'GAD4', 'GAD5',
'GAD6', 'GAD7', 'SWL1', 'SWL2', 'SWL3', 'SWL4', 'SWL5', 'SPIN1', 'SPIN2', 'SPIN3', 'SPIN4',
'SPIN5', 'SPIN6', 'SPIN7', 'SPIN8', 'SPIN9', 'SPIN10', 'SPIN11', 'SPIN12', 'SPIN13',
'SPIN14', 'SPIN15', 'SPIN16', 'SPIN17'])
df.head()
| GADE | Game | Platform | Hours | earnings | whyplay | streams | Gender | Age | Work | Degree | Reference | Playstyle | GAD_T | SWL_T | SPIN_T | Residence_ISO3 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Not difficult at all | Skyrim | Console (PS, Xbox, ...) | 15.0 | I play for fun | having fun | 0.0 | Male | 25 | Unemployed / between jobs | Bachelor (or equivalent) | Singleplayer | 1 | 23 | 5.0 | USA | |
| 1 | Somewhat difficult | Other | PC | 8.0 | I play for fun | having fun | 2.0 | Male | 41 | Unemployed / between jobs | Bachelor (or equivalent) | Multiplayer - online - with strangers | 8 | 16 | 33.0 | USA | |
| 2 | Not difficult at all | Other | PC | 0.0 | I play for fun | having fun | 0.0 | Female | 32 | Employed | Bachelor (or equivalent) | Singleplayer | 8 | 17 | 31.0 | DEU | |
| 3 | Not difficult at all | Other | PC | 20.0 | I play for fun | improving | 5.0 | Male | 28 | Employed | Bachelor (or equivalent) | Multiplayer - online - with online acquaintanc... | 0 | 17 | 11.0 | USA | |
| 4 | Very difficult | Other | Console (PS, Xbox, ...) | 20.0 | I play for fun | having fun | 1.0 | Male | 19 | Employed | High school diploma (or equivalent) | Multiplayer - online - with strangers | 14 | 14 | 13.0 | KOR |
I manually filtered each column here after going through all possible combinations.
df["Platform"] = df["Platform"].map(lambda x: x.rstrip('(PS, Xbox, ...)') if "console" in x else x)
df["Playstyle"] = df["Playstyle"].map(lambda x: "Multiplayer" if any([y in x.lower() for y in ["multiplayer", "online", "friend", "stranger", "internet", "match"]])
else "SinglePlayer" if any([y in x.lower() for y in ["single", "alone", "solo", "one"]])
else "Both" if any([y in x.lower() for y in ["all", "both", "everything", "mix", "5"]]) else "Other")
df["earnings"] = df["earnings"].map(lambda x: 1 if any([y in x.lower() for y in ["earn", "money", "both", "pay", "paid", "living", "$", "pro", "career", "job", "tourn"]]) else 0)
df["whyplay"] = df["whyplay"].map(lambda x: "all" if all([y in x.lower() for y in ["fun", "improv", "relax", "win"]]) or
any([y in x.lower() for y in ["all", "a b c", "4", "any", "everything"]])
else "improve and relax" if all([y in x.lower() for y in ["improv", "relax"]])
else "improve and win" if all([y in x.lower() for y in ["improv", "win"]]) or
any([y in x.lower() for y in ["goal"]])
else "relax and win" if all([y in x.lower() for y in ["relax", "win"]])
else "fun and relax" if all([y in x.lower() for y in ["fun", "relax"]]) or
any([y in x.lower() for y in ["friend", "passing"]])
else "fun and improve" if all([y in x.lower() for y in ["fun", "improv"]])
else "fun and winning" if all([y in x.lower() for y in ["fun", "winning"]]) or
any([y in x.lower() for y in ["loot"]])
else "fun" if "fun" in x or
any([y in x.lower() for y in ["bored", "socializ"]])
else "improve" if "improv" in x
else "relax" if any([y in x.lower() for y in ["relax", "stress", "forget", "depress", "distract", "wast", "escap", "problem"]])
else "win" if "win" in x else "Other")
df["Degree"] = df["Degree"].map(lambda x: "Bachelor" if "Bachelor" in x
else "High School" if "High School" in x
else "Doctorate" if all([y in x for y in ["Ph.D.", "Psy. D.", "MD"]])
else "Master" if "Master" in x else "Other")
df.head()
| GADE | Game | Platform | Hours | earnings | whyplay | streams | Gender | Age | Work | Degree | Reference | Playstyle | GAD_T | SWL_T | SPIN_T | Residence_ISO3 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Not difficult at all | Skyrim | Console (PS, Xbox, ...) | 15.0 | 0 | fun | 0.0 | Male | 25 | Unemployed / between jobs | Bachelor | SinglePlayer | 1 | 23 | 5.0 | USA | |
| 1 | Somewhat difficult | Other | PC | 8.0 | 0 | fun | 2.0 | Male | 41 | Unemployed / between jobs | Bachelor | Multiplayer | 8 | 16 | 33.0 | USA | |
| 2 | Not difficult at all | Other | PC | 0.0 | 0 | fun | 0.0 | Female | 32 | Employed | Bachelor | SinglePlayer | 8 | 17 | 31.0 | DEU | |
| 3 | Not difficult at all | Other | PC | 20.0 | 0 | improve | 5.0 | Male | 28 | Employed | Bachelor | Multiplayer | 0 | 17 | 11.0 | USA | |
| 4 | Very difficult | Other | Console (PS, Xbox, ...) | 20.0 | 0 | fun | 1.0 | Male | 19 | Employed | Other | Multiplayer | 14 | 14 | 13.0 | KOR |
df.describe()
| Hours | earnings | streams | Age | GAD_T | SWL_T | SPIN_T | |
|---|---|---|---|---|---|---|---|
| count | 13434.000000 | 13464.000000 | 13364.000000 | 13464.000000 | 13464.000000 | 13464.000000 | 12814.000000 |
| mean | 22.247357 | 0.087938 | 11.233538 | 20.930407 | 5.211973 | 19.788844 | 19.848525 |
| std | 70.284502 | 0.283216 | 78.549209 | 3.300897 | 4.713267 | 7.229243 | 13.467493 |
| min | 0.000000 | 0.000000 | 0.000000 | 18.000000 | 0.000000 | 5.000000 | 0.000000 |
| 25% | 12.000000 | 0.000000 | 4.000000 | 18.000000 | 2.000000 | 14.000000 | 9.000000 |
| 50% | 20.000000 | 0.000000 | 8.000000 | 20.000000 | 4.000000 | 20.000000 | 17.000000 |
| 75% | 28.000000 | 0.000000 | 15.000000 | 22.000000 | 8.000000 | 26.000000 | 28.000000 |
| max | 8000.000000 | 1.000000 | 9001.000000 | 63.000000 | 21.000000 | 35.000000 | 68.000000 |
df.isna().sum()
GADE 649 Game 0 Platform 0 Hours 30 earnings 0 whyplay 0 streams 100 Gender 0 Age 0 Work 38 Degree 0 Reference 15 Playstyle 0 GAD_T 0 SWL_T 0 SPIN_T 650 Residence_ISO3 110 dtype: int64
df['SPIN_T'].fillna(value=df['SPIN_T'].mean(), inplace=True)
df.dropna(subset=['GADE'], inplace=True)
df.dropna(subset=['Hours'], inplace=True)
df.dropna(subset=['streams'], inplace=True)
df.dropna(subset=['Work'], inplace=True)
df.dropna(subset=['Residence_ISO3'], inplace=True)
df.dropna(subset=['Reference'], inplace=True)
df.isna().sum()
GADE 0 Game 0 Platform 0 Hours 0 earnings 0 whyplay 0 streams 0 Gender 0 Age 0 Work 0 Degree 0 Reference 0 Playstyle 0 GAD_T 0 SWL_T 0 SPIN_T 0 Residence_ISO3 0 dtype: int64
df.drop(df[df.Hours >= 120].index, inplace=True)
df.drop(df[df.streams >= 120].index, inplace=True)
df['Residence_ISO3']= LabelEncoder().fit_transform(df['Residence_ISO3'])
df.head()
| GADE | Game | Platform | Hours | earnings | whyplay | streams | Gender | Age | Work | Degree | Reference | Playstyle | GAD_T | SWL_T | SPIN_T | Residence_ISO3 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Not difficult at all | Skyrim | Console (PS, Xbox, ...) | 15.0 | 0 | fun | 0.0 | Male | 25 | Unemployed / between jobs | Bachelor | SinglePlayer | 1 | 23 | 5.0 | 102 | |
| 1 | Somewhat difficult | Other | PC | 8.0 | 0 | fun | 2.0 | Male | 41 | Unemployed / between jobs | Bachelor | Multiplayer | 8 | 16 | 33.0 | 102 | |
| 2 | Not difficult at all | Other | PC | 0.0 | 0 | fun | 0.0 | Female | 32 | Employed | Bachelor | SinglePlayer | 8 | 17 | 31.0 | 23 | |
| 3 | Not difficult at all | Other | PC | 20.0 | 0 | improve | 5.0 | Male | 28 | Employed | Bachelor | Multiplayer | 0 | 17 | 11.0 | 102 | |
| 4 | Very difficult | Other | Console (PS, Xbox, ...) | 20.0 | 0 | fun | 1.0 | Male | 19 | Employed | Other | Multiplayer | 14 | 14 | 13.0 | 56 |
conditions = [
(df['GAD_T'] <= 4),
(df['GAD_T'] >= 5) & (df['GAD_T'] <= 9),
(df['GAD_T'] >= 10) & (df['GAD_T'] <= 14),
(df['GAD_T'] >= 15)
]
values = ['minimal', 'mild', 'moderate', 'severe']
df['GAD'] = np.select(conditions, values)
df = df.drop(["GAD_T"], axis=1)
df.head()
| GADE | Game | Platform | Hours | earnings | whyplay | streams | Gender | Age | Work | Degree | Reference | Playstyle | SWL_T | SPIN_T | Residence_ISO3 | GAD | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Not difficult at all | Skyrim | Console (PS, Xbox, ...) | 15.0 | 0 | fun | 0.0 | Male | 25 | Unemployed / between jobs | Bachelor | SinglePlayer | 23 | 5.0 | 102 | minimal | |
| 1 | Somewhat difficult | Other | PC | 8.0 | 0 | fun | 2.0 | Male | 41 | Unemployed / between jobs | Bachelor | Multiplayer | 16 | 33.0 | 102 | mild | |
| 2 | Not difficult at all | Other | PC | 0.0 | 0 | fun | 0.0 | Female | 32 | Employed | Bachelor | SinglePlayer | 17 | 31.0 | 23 | mild | |
| 3 | Not difficult at all | Other | PC | 20.0 | 0 | improve | 5.0 | Male | 28 | Employed | Bachelor | Multiplayer | 17 | 11.0 | 102 | minimal | |
| 4 | Very difficult | Other | Console (PS, Xbox, ...) | 20.0 | 0 | fun | 1.0 | Male | 19 | Employed | Other | Multiplayer | 14 | 13.0 | 56 | moderate |
conditions = [
(df['SWL_T'] <= 9),
(df['SWL_T'] >= 10) & (df['SWL_T'] <= 14),
(df['SWL_T'] >= 15) & (df['SWL_T'] <= 19),
(df['SWL_T'] == 20),
(df['SWL_T'] >= 21) & (df['SWL_T'] <= 25),
(df['SWL_T'] >= 26) & (df['SWL_T'] <= 30),
(df['SWL_T'] >= 31) & (df['SWL_T'] <= 35)
]
values = ['extremely dissatisfied', 'dissatisfied', 'slightly dissatisfied', 'neutral', 'slightly satisfied', 'satisfied', 'extremely satisfied']
df['SWL'] = np.select(conditions, values)
df = df.drop(["SWL_T"], axis=1)
df.head()
| GADE | Game | Platform | Hours | earnings | whyplay | streams | Gender | Age | Work | Degree | Reference | Playstyle | SPIN_T | Residence_ISO3 | GAD | SWL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Not difficult at all | Skyrim | Console (PS, Xbox, ...) | 15.0 | 0 | fun | 0.0 | Male | 25 | Unemployed / between jobs | Bachelor | SinglePlayer | 5.0 | 102 | minimal | slightly satisfied | |
| 1 | Somewhat difficult | Other | PC | 8.0 | 0 | fun | 2.0 | Male | 41 | Unemployed / between jobs | Bachelor | Multiplayer | 33.0 | 102 | mild | slightly dissatisfied | |
| 2 | Not difficult at all | Other | PC | 0.0 | 0 | fun | 0.0 | Female | 32 | Employed | Bachelor | SinglePlayer | 31.0 | 23 | mild | slightly dissatisfied | |
| 3 | Not difficult at all | Other | PC | 20.0 | 0 | improve | 5.0 | Male | 28 | Employed | Bachelor | Multiplayer | 11.0 | 102 | minimal | slightly dissatisfied | |
| 4 | Very difficult | Other | Console (PS, Xbox, ...) | 20.0 | 0 | fun | 1.0 | Male | 19 | Employed | Other | Multiplayer | 13.0 | 56 | moderate | dissatisfied |
conditions = [
(df['SPIN_T'] <= 20),
(df['SPIN_T'] >= 21) & (df['SPIN_T'] <= 30),
(df['SPIN_T'] >= 31) & (df['SPIN_T'] <= 40),
(df['SPIN_T'] >= 41) & (df['SPIN_T'] <= 50),
(df['SPIN_T'] >= 51)
]
values = ['minimal', 'mild', 'moderate', 'severe', 'extreme']
df['SPIN'] = np.select(conditions, values)
df = df.drop(["SPIN_T"], axis=1)
df.head()
| GADE | Game | Platform | Hours | earnings | whyplay | streams | Gender | Age | Work | Degree | Reference | Playstyle | Residence_ISO3 | GAD | SWL | SPIN | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Not difficult at all | Skyrim | Console (PS, Xbox, ...) | 15.0 | 0 | fun | 0.0 | Male | 25 | Unemployed / between jobs | Bachelor | SinglePlayer | 102 | minimal | slightly satisfied | minimal | |
| 1 | Somewhat difficult | Other | PC | 8.0 | 0 | fun | 2.0 | Male | 41 | Unemployed / between jobs | Bachelor | Multiplayer | 102 | mild | slightly dissatisfied | moderate | |
| 2 | Not difficult at all | Other | PC | 0.0 | 0 | fun | 0.0 | Female | 32 | Employed | Bachelor | SinglePlayer | 23 | mild | slightly dissatisfied | moderate | |
| 3 | Not difficult at all | Other | PC | 20.0 | 0 | improve | 5.0 | Male | 28 | Employed | Bachelor | Multiplayer | 102 | minimal | slightly dissatisfied | minimal | |
| 4 | Very difficult | Other | Console (PS, Xbox, ...) | 20.0 | 0 | fun | 1.0 | Male | 19 | Employed | Other | Multiplayer | 56 | moderate | dissatisfied | minimal |
corr = df.corr()
z = np.array(corr)
fig = ff.create_annotated_heatmap(z, x = list(corr.columns), y = list(corr.index),
annotation_text = np.around(z, decimals=2),
hoverinfo='z')
fig.show()
fig = px.histogram(df, x="Age")
fig.show()
fig = px.histogram(df, y="Platform", color="Gender", barmode="group", log_x=True)
fig.show()
fig = px.histogram(df, y="Playstyle", color="Gender", barmode="group", log_x=True)
fig.show()
fig = px.violin(df, y="Hours", x="Work",color="Work", hover_data=[df.Hours])
fig.update_layout(showlegend=False)
fig.show()
fig = px.pie(df, values=df.index, names='GAD')
fig.show()
fig = px.pie(df, values=df.index, names='SWL')
fig.show()
fig = px.pie(df, values=df.index, names='SPIN')
fig.show()
I started by defining my own label column based on cut-off values, but I was not able to separate the data properly and I was always underfitting. So, I tried using clustering here to find patterns.
I am using k-Prototype here as k-Means does not work with Categorical values and k-Mode only works with Categorical values. The k-Prototype algorithm is an extension to the k-Modes algorithm that combines the k-modes and k-means algorithms and is able to cluster mixed numerical and categorical variables.
categorical_features_idx = [0, 1, 2, 4, 5, 7, 9, 10, 11, 12, 13, 14, 15, 16]
mark_array=df.values
kproto = KPrototypes(n_clusters=4, verbose=2, max_iter=20).fit(mark_array, categorical=categorical_features_idx)
Initialization method and algorithm are deterministic. Setting n_init to 1. Init: initializing centroids Init: initializing clusters Starting iterations... Run: 1, iteration: 1/20, moves: 4261, ncost: 1782368.1216841005 Run: 1, iteration: 2/20, moves: 2786, ncost: 1645058.848518113 Run: 1, iteration: 3/20, moves: 1769, ncost: 1564083.6645404622 Run: 1, iteration: 4/20, moves: 899, ncost: 1541589.3414343747 Run: 1, iteration: 5/20, moves: 159, ncost: 1539578.6743116563 Run: 1, iteration: 6/20, moves: 12, ncost: 1539569.4896820632 Run: 1, iteration: 7/20, moves: 1, ncost: 1539569.4317001232 Run: 1, iteration: 8/20, moves: 0, ncost: 1539569.4317001232 Init: initializing centroids Init: initializing clusters Starting iterations... Run: 2, iteration: 1/20, moves: 1043, ncost: 1751398.762116424 Run: 2, iteration: 2/20, moves: 1187, ncost: 1585792.9930233965 Run: 2, iteration: 3/20, moves: 677, ncost: 1553942.7185046503 Run: 2, iteration: 4/20, moves: 193, ncost: 1550902.3147146506 Run: 2, iteration: 5/20, moves: 35, ncost: 1550864.7742551342 Run: 2, iteration: 6/20, moves: 5, ncost: 1550863.7879512229 Run: 2, iteration: 7/20, moves: 0, ncost: 1550863.7879512229 Init: initializing centroids Init: initializing clusters Starting iterations... Run: 3, iteration: 1/20, moves: 2423, ncost: 1558492.0816207244 Run: 3, iteration: 2/20, moves: 1057, ncost: 1529640.8407042166 Run: 3, iteration: 3/20, moves: 114, ncost: 1528484.1825467008 Run: 3, iteration: 4/20, moves: 30, ncost: 1528403.7660156093 Run: 3, iteration: 5/20, moves: 5, ncost: 1528401.6038185547 Run: 3, iteration: 6/20, moves: 0, ncost: 1528401.6038185547 Init: initializing centroids Init: initializing clusters Starting iterations... Run: 4, iteration: 1/20, moves: 3207, ncost: 1663611.4336055913 Run: 4, iteration: 2/20, moves: 1171, ncost: 1558765.5763644057 Run: 4, iteration: 3/20, moves: 552, ncost: 1539304.1384839974 Run: 4, iteration: 4/20, moves: 341, ncost: 1530093.5799751224 Run: 4, iteration: 5/20, moves: 98, ncost: 1528550.7755063504 Run: 4, iteration: 6/20, moves: 37, ncost: 1528403.7660156093 Run: 4, iteration: 7/20, moves: 5, ncost: 1528401.6038185547 Run: 4, iteration: 8/20, moves: 0, ncost: 1528401.6038185547 Init: initializing centroids Init: initializing clusters Starting iterations... Run: 5, iteration: 1/20, moves: 3647, ncost: 1802212.9587056346 Run: 5, iteration: 2/20, moves: 3340, ncost: 1583650.6075399693 Run: 5, iteration: 3/20, moves: 1204, ncost: 1543359.7662458392 Run: 5, iteration: 4/20, moves: 286, ncost: 1539676.4314685452 Run: 5, iteration: 5/20, moves: 39, ncost: 1539582.17124867 Run: 5, iteration: 6/20, moves: 12, ncost: 1539569.3605938305 Run: 5, iteration: 7/20, moves: 1, ncost: 1539569.3259381542 Run: 5, iteration: 8/20, moves: 0, ncost: 1539569.3259381542 Init: initializing centroids Init: initializing clusters Starting iterations... Run: 6, iteration: 1/20, moves: 4281, ncost: 1641081.242197436 Run: 6, iteration: 2/20, moves: 1841, ncost: 1554717.4498859472 Run: 6, iteration: 3/20, moves: 634, ncost: 1540122.1209519957 Run: 6, iteration: 4/20, moves: 96, ncost: 1539576.6221213548 Run: 6, iteration: 5/20, moves: 11, ncost: 1539569.3259381542 Run: 6, iteration: 6/20, moves: 0, ncost: 1539569.3259381542 Init: initializing centroids Init: initializing clusters Starting iterations... Run: 7, iteration: 1/20, moves: 2671, ncost: 1675566.0681372103 Run: 7, iteration: 2/20, moves: 1769, ncost: 1580447.8998761047 Run: 7, iteration: 3/20, moves: 1277, ncost: 1543393.8565527913 Run: 7, iteration: 4/20, moves: 275, ncost: 1539625.6744574825 Run: 7, iteration: 5/20, moves: 25, ncost: 1539569.4896820632 Run: 7, iteration: 6/20, moves: 1, ncost: 1539569.4317001232 Run: 7, iteration: 7/20, moves: 0, ncost: 1539569.4317001232 Init: initializing centroids Init: initializing clusters Starting iterations... Run: 8, iteration: 1/20, moves: 3110, ncost: 1622543.7908887756 Run: 8, iteration: 2/20, moves: 1940, ncost: 1548212.6714456188 Run: 8, iteration: 3/20, moves: 542, ncost: 1539777.1065220442 Run: 8, iteration: 4/20, moves: 59, ncost: 1539591.5378910885 Run: 8, iteration: 5/20, moves: 14, ncost: 1539570.811111528 Run: 8, iteration: 6/20, moves: 5, ncost: 1539569.3259381542 Run: 8, iteration: 7/20, moves: 0, ncost: 1539569.3259381542 Init: initializing centroids Init: initializing clusters Starting iterations... Run: 9, iteration: 1/20, moves: 4056, ncost: 1855342.234283102 Run: 9, iteration: 2/20, moves: 2125, ncost: 1709782.9445753312 Run: 9, iteration: 3/20, moves: 656, ncost: 1693400.576341686 Run: 9, iteration: 4/20, moves: 546, ncost: 1684109.1928503346 Run: 9, iteration: 5/20, moves: 411, ncost: 1679676.7154780398 Run: 9, iteration: 6/20, moves: 126, ncost: 1679153.160412582 Run: 9, iteration: 7/20, moves: 97, ncost: 1678604.8229056443 Run: 9, iteration: 8/20, moves: 354, ncost: 1672859.436926949 Run: 9, iteration: 9/20, moves: 340, ncost: 1668193.058706097 Run: 9, iteration: 10/20, moves: 105, ncost: 1667707.397170218 Run: 9, iteration: 11/20, moves: 25, ncost: 1667685.9769660353 Run: 9, iteration: 12/20, moves: 11, ncost: 1667644.8612211242 Run: 9, iteration: 13/20, moves: 29, ncost: 1667338.6539913032 Run: 9, iteration: 14/20, moves: 12, ncost: 1667318.463251469 Run: 9, iteration: 15/20, moves: 1, ncost: 1667318.4209270775 Run: 9, iteration: 16/20, moves: 0, ncost: 1667318.4209270775 Init: initializing centroids Init: initializing clusters Starting iterations... Run: 10, iteration: 1/20, moves: 2951, ncost: 1591863.148218282 Run: 10, iteration: 2/20, moves: 1421, ncost: 1554241.0215132083 Run: 10, iteration: 3/20, moves: 382, ncost: 1550923.507064586 Run: 10, iteration: 4/20, moves: 49, ncost: 1550874.7637688234 Run: 10, iteration: 5/20, moves: 18, ncost: 1550866.05655826 Run: 10, iteration: 6/20, moves: 8, ncost: 1550863.9737742052 Run: 10, iteration: 7/20, moves: 2, ncost: 1550863.64056594 Run: 10, iteration: 8/20, moves: 0, ncost: 1550863.64056594 Best run was number 3
print(kproto.cluster_centroids_)
[['18.596765498652292' '27.628032345013477' '21.19191374663073' 'Not difficult at all' 'League of Legends' 'PC' '0' 'improve' 'Male' 'Student at college / university' 'Other' 'Reddit' 'Multiplayer' '102' 'minimal' 'slightly dissatisfied' 'minimal'] ['13.916678805535325' '6.450400582665695' '21.05972323379461' 'Not difficult at all' 'League of Legends' 'PC' '0' 'fun' 'Male' 'Student at college / university' 'Other' 'Reddit' 'Multiplayer' '102' 'minimal' 'slightly satisfied' 'minimal'] ['31.351577591757888' '8.328396651641983' '20.56342562781713' 'Not difficult at all' 'League of Legends' 'PC' '0' 'improve' 'Male' 'Student at college / university' 'Other' 'Reddit' 'Multiplayer' '102' 'minimal' 'slightly dissatisfied' 'minimal'] ['58.56467315716272' '14.584144645340752' '20.289290681502088' 'Not difficult at all' 'League of Legends' 'PC' '0' 'improve' 'Male' 'Student at college / university' 'Other' 'Reddit' 'Multiplayer' '102' 'minimal' 'dissatisfied' 'minimal']]
But clustering didn't seem to work. No matter how many different combinations I tried, or how many clusters I made, I was not able to divide data such that all tests could have concerning cut-off value.
So, I will use my own target variable derived from those tests.
df["GAD"] = df["GAD"].map(lambda x: 1 if any([y in x for y in ["mild", "moderate", "severe"]]) else 0)
df["SWL"] = df["SWL"].map(lambda x: 1 if any([y in x for y in ["extremely dissatisfied", "dissatified"]]) else 0)
df["SPIN"] = df["SPIN"].map(lambda x: 1 if any([y in x for y in ["moderate", "severe", "extreme"]]) else 0)
My model will try to classify gamer experiences any amount of stress or anxiety using cut-off values by tests.
df['target'] = np.where((df['GAD']+df['SWL']+df['SPIN'] >= 1), 1, 0)
df = df.drop(columns= ["GAD", "SWL", "SPIN"])
df.head()
| GADE | Game | Platform | Hours | earnings | whyplay | streams | Gender | Age | Work | Degree | Reference | Playstyle | Residence_ISO3 | target | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Not difficult at all | Skyrim | Console (PS, Xbox, ...) | 15.0 | 0 | fun | 0.0 | Male | 25 | Unemployed / between jobs | Bachelor | SinglePlayer | 102 | 0 | |
| 1 | Somewhat difficult | Other | PC | 8.0 | 0 | fun | 2.0 | Male | 41 | Unemployed / between jobs | Bachelor | Multiplayer | 102 | 1 | |
| 2 | Not difficult at all | Other | PC | 0.0 | 0 | fun | 0.0 | Female | 32 | Employed | Bachelor | SinglePlayer | 23 | 1 | |
| 3 | Not difficult at all | Other | PC | 20.0 | 0 | improve | 5.0 | Male | 28 | Employed | Bachelor | Multiplayer | 102 | 0 | |
| 4 | Very difficult | Other | Console (PS, Xbox, ...) | 20.0 | 0 | fun | 1.0 | Male | 19 | Employed | Other | Multiplayer | 56 | 1 |
df = pd.get_dummies(df, drop_first=False)
df.head()
| Hours | earnings | streams | Age | Residence_ISO3 | target | GADE_Extremely difficult | GADE_Not difficult at all | GADE_Somewhat difficult | GADE_Very difficult | ... | Degree_Master | Degree_Other | Reference_CrowdFlower | Reference_Other | Reference_Reddit | Reference_TeamLiquid.net | Playstyle_Both | Playstyle_Multiplayer | Playstyle_Other | Playstyle_SinglePlayer | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 15.0 | 0 | 0.0 | 25 | 102 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
| 1 | 8.0 | 0 | 2.0 | 41 | 102 | 1 | 0 | 0 | 1 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
| 2 | 0.0 | 0 | 0.0 | 32 | 23 | 1 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 |
| 3 | 20.0 | 0 | 5.0 | 28 | 102 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
| 4 | 20.0 | 0 | 1.0 | 19 | 56 | 1 | 0 | 0 | 0 | 1 | ... | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
5 rows × 55 columns
y = df[['target']]
X = df.drop(['target'], axis=1)
X_train, X_test, y_train, y_test = train_test_split(X, y, stratify=y, test_size=0.10)
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, stratify=y_train, test_size=0.10)
X_train.head()
| Hours | earnings | streams | Age | Residence_ISO3 | GADE_Extremely difficult | GADE_Not difficult at all | GADE_Somewhat difficult | GADE_Very difficult | Game_Counter Strike | ... | Degree_Master | Degree_Other | Reference_CrowdFlower | Reference_Other | Reference_Reddit | Reference_TeamLiquid.net | Playstyle_Both | Playstyle_Multiplayer | Playstyle_Other | Playstyle_SinglePlayer | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 13173 | 10.0 | 0 | 4.0 | 18 | 23 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
| 5933 | 24.0 | 0 | 20.0 | 26 | 102 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
| 7731 | 35.0 | 0 | 2.0 | 20 | 102 | 0 | 1 | 0 | 0 | 0 | ... | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
| 5118 | 20.0 | 0 | 2.0 | 22 | 80 | 0 | 0 | 0 | 1 | 0 | ... | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
| 8933 | 50.0 | 0 | 10.0 | 20 | 31 | 0 | 0 | 1 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |
5 rows × 54 columns
y_train.head()
| target | |
|---|---|
| 13173 | 0 |
| 5933 | 0 |
| 7731 | 1 |
| 5118 | 1 |
| 8933 | 1 |
ss = StandardScaler()
ss.fit(X_train)
X_train = ss.transform(X_train)
X_val = ss.transform(X_val)
X_test = ss.transform(X_test)
def performance_metrics(y_val, pred_val):
print('Recall: ', metrics.recall_score(y_val, pred_val))
tn, fp, fn, tp = metrics.confusion_matrix(y_val, pred_val).ravel()
print('Specificity: ', (tn / (tn + fp)))
print('Precision: ', metrics.precision_score(y_val, pred_val))
print('F1-score: ', metrics.f1_score(y_val, pred_val))
print('Balanced Accuracy: ', metrics.balanced_accuracy_score(y_val, pred_val))
model = RandomForestClassifier()
model.fit(X_train, y_train)
RandomForestClassifier()
pred_train = model.predict(X_train)
print("Train score:", metrics.f1_score(y_train, pred_train))
Train score: 0.9925020827547902
pred_val = model.predict(X_val)
print("Validation score:", metrics.f1_score(y_val, pred_val))
Validation score: 0.7278797996661102
min_samples_splits = [10, 50, 100, 200]
max_depths = [2,5,10,15]
n_estimators = [100, 500]
params = {
"min_samples_split": min_samples_splits,
"max_depth": max_depths,
"n_estimators": n_estimators
}
grid_search = GridSearchCV(estimator=model, param_grid=params, scoring="f1", n_jobs=-1)
grid_search.fit(X_train, y_train)
GridSearchCV(estimator=RandomForestClassifier(), n_jobs=-1,
param_grid={'max_depth': [2, 5, 10, 15],
'min_samples_split': [10, 50, 100, 200],
'n_estimators': [100, 500]},
scoring='f1')
print("Best hyperparameter values: ", grid_search.best_params_)
print("Train Accuracy :", grid_search.best_score_)
Best hyperparameter values: {'max_depth': 10, 'min_samples_split': 200, 'n_estimators': 500}
Train Accuracy : 0.7500868186356877
pred_val = grid_search.predict(X_val)
print("Validation score:", metrics.f1_score(y_val, pred_val))
Validation score: 0.7549342105263157
metrics.ConfusionMatrixDisplay(confusion_matrix=metrics.confusion_matrix(y_val, pred_val)).plot()
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x2b1e67796d0>
performance_metrics(y_val, pred_val)
Recall: 0.7637271214642263 Specificity: 0.7045454545454546 Precision: 0.7463414634146341 F1-score: 0.7549342105263157 Balanced Accuracy: 0.7341362880048404
estimator = LogisticRegression()
estimator.fit(X_train, y_train)
LogisticRegression()
pred_train = estimator.predict(X_train)
print("Train score:", metrics.f1_score(y_train, pred_train))
Train score: 0.7499764750164675
pred_val = estimator.predict(X_val)
print("Validation score:", metrics.f1_score(y_val, pred_val))
Validation score: 0.7535269709543568
parameters = {
'solver': ['newton-cg', 'lbfgs', 'liblinear'],
'penalty': ['l2'],
'C': [100, 10, 1.0, 0.1, 0.01]
}
grid_search = GridSearchCV(estimator, parameters, cv=10, n_jobs=-1, scoring='f1')
grid_search.fit(X_train,y_train)
GridSearchCV(cv=10, estimator=LogisticRegression(), n_jobs=-1,
param_grid={'C': [100, 10, 1.0, 0.1, 0.01], 'penalty': ['l2'],
'solver': ['newton-cg', 'lbfgs', 'liblinear']},
scoring='f1')
print("Best hyperparameter values: ", grid_search.best_params_)
print("Training score :", grid_search.best_score_)
Best hyperparameter values: {'C': 0.01, 'penalty': 'l2', 'solver': 'newton-cg'}
Training score : 0.7489636142737558
pred_val = grid_search.predict(X_val)
print("Validation score:", metrics.f1_score(y_val, pred_val))
Validation score: 0.7545605306799336
metrics.ConfusionMatrixDisplay(confusion_matrix=metrics.confusion_matrix(y_val, pred_val)).plot()
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x2b1e611f1c0>
performance_metrics(y_val, pred_val)
Recall: 0.757071547420965 Specificity: 0.7159090909090909 Precision: 0.7520661157024794 F1-score: 0.7545605306799336 Balanced Accuracy: 0.736490319165028
estimator = KNeighborsClassifier()
estimator.fit(X_train, y_train)
KNeighborsClassifier()
pred_train = estimator.predict(X_train)
print("Train score:", metrics.f1_score(y_train, pred_train))
Train score: 0.7852215817821033
pred_val = estimator.predict(X_val)
print("Validation score:", metrics.f1_score(y_val, pred_val))
Validation score: 0.699581589958159
parameters = {
'n_neighbors': range(1, 21, 2),
'weights': ['uniform', 'distance'],
'metric': ['euclidean', 'manhattan', 'minkowski']
}
grid_search = GridSearchCV(estimator, parameters, cv=10, n_jobs=-1, scoring='f1')
grid_search.fit(X_train,y_train)
GridSearchCV(cv=10, estimator=KNeighborsClassifier(), n_jobs=-1,
param_grid={'metric': ['euclidean', 'manhattan', 'minkowski'],
'n_neighbors': range(1, 21, 2),
'weights': ['uniform', 'distance']},
scoring='f1')
print("Best hyperparameter values: ", grid_search.best_params_)
print("Training score :", grid_search.best_score_)
Best hyperparameter values: {'metric': 'manhattan', 'n_neighbors': 19, 'weights': 'uniform'}
Training score : 0.7197998110096405
pred_val = grid_search.predict(X_val)
print("Validation score:", metrics.f1_score(y_val, pred_val))
Validation score: 0.7362270450751252
metrics.ConfusionMatrixDisplay(confusion_matrix=metrics.confusion_matrix(y_val, pred_val)).plot()
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x2b1e4da98e0>
performance_metrics(y_val, pred_val)
Recall: 0.7337770382695508 Specificity: 0.7045454545454546 Precision: 0.7386934673366834 F1-score: 0.7362270450751252 Balanced Accuracy: 0.7191612464075027
model = AdaBoostClassifier()
model.fit(X_train, y_train)
AdaBoostClassifier()
pred_train = model.predict(X_train)
print("Train score:", metrics.f1_score(y_train, pred_train))
Train score: 0.7503762227238524
pred_val = model.predict(X_val)
print("Validation score:", metrics.f1_score(y_val, pred_val))
Validation score: 0.7514546965918537
parameters = {
'n_estimators': range(10, 200, 5),
'learning_rate': [0.01, 0.05, 0.1, 0.25, 0.5, 1, 10, 100]
}
grid_search = GridSearchCV(model, parameters, cv=10, n_jobs=-1, scoring='f1')
grid_search.fit(X_train,y_train)
GridSearchCV(cv=10, estimator=AdaBoostClassifier(), n_jobs=-1,
param_grid={'learning_rate': [0.01, 0.05, 0.1, 0.25, 0.5, 1, 10,
100],
'n_estimators': range(10, 200, 5)},
scoring='f1')
print("Best hyperparameter values: ", grid_search.best_params_)
print("Train score :", grid_search.best_score_)
Best hyperparameter values: {'learning_rate': 1, 'n_estimators': 45}
Train score : 0.7491643493073873
pred_val = grid_search.predict(X_val)
print("Validation score:", metrics.f1_score(y_val, pred_val))
Validation score: 0.7518672199170123
metrics.ConfusionMatrixDisplay(confusion_matrix=metrics.confusion_matrix(y_val, pred_val)).plot()
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x2b1e621dbb0>
performance_metrics(y_val, pred_val)
Recall: 0.7537437603993344 Specificity: 0.7140151515151515 Precision: 0.75 F1-score: 0.7518672199170123 Balanced Accuracy: 0.733879455957243
estimator = SVC(kernel ='rbf')
estimator.fit(X_train, y_train)
SVC()
pred_train = estimator.predict(X_train)
print("Train score:", metrics.f1_score(y_train, pred_train))
Train score: 0.7579125847776941
pred_val = estimator.predict(X_val)
print("Validation score:", metrics.f1_score(y_val, pred_val))
Validation score: 0.7529215358931554
parameters = {
'C': [1, 10, 100, 1000],
'gamma': [0.001, 0.01, 0.1, 1]
}
grid_search = GridSearchCV(estimator, parameters, cv=10, n_jobs=-1, scoring='f1')
grid_search.fit(X_train,y_train)
GridSearchCV(cv=10, estimator=SVC(), n_jobs=-1,
param_grid={'C': [1, 10, 100, 1000],
'gamma': [0.001, 0.01, 0.1, 1]},
scoring='f1')
print("Best hyperparameter values: ", grid_search.best_params_)
print("Training score :", grid_search.best_score_)
Best hyperparameter values: {'C': 10, 'gamma': 0.001}
Training score : 0.7473337588537545
pred_val = grid_search.predict(X_val)
print("Validation score:", metrics.f1_score(y_val, pred_val))
Validation score: 0.7572977481234363
metrics.ConfusionMatrixDisplay(confusion_matrix=metrics.confusion_matrix(y_val, pred_val)).plot()
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x2b1e81cbdc0>
performance_metrics(y_val, pred_val)
Recall: 0.7554076539101497 Specificity: 0.7272727272727273 Precision: 0.7591973244147158 F1-score: 0.7572977481234363 Balanced Accuracy: 0.7413401905914385
model = GradientBoostingClassifier()
model.fit(X_train, y_train)
GradientBoostingClassifier()
pred_train = model.predict(X_train)
print("Train score:", metrics.f1_score(y_train, pred_train))
Train score: 0.7573742859818335
pred_val = model.predict(X_val)
print("Validation score:", metrics.f1_score(y_val, pred_val))
Validation score: 0.7524752475247526
parameters = {
'n_estimators': range(1, 15),
'learning_rate': [0.01, 0.05, 0.1, 0.25, 0.5, 1, 10, 100]
}
grid_search = GridSearchCV(model, parameters, cv=10, n_jobs=-1, scoring='f1')
grid_search.fit(X_train,y_train)
GridSearchCV(cv=10, estimator=GradientBoostingClassifier(), n_jobs=-1,
param_grid={'learning_rate': [0.01, 0.05, 0.1, 0.25, 0.5, 1, 10,
100],
'n_estimators': range(1, 15)},
scoring='f1')
print("Best hyperparameter values: ", grid_search.best_params_)
print("Train score :", grid_search.best_score_)
Best hyperparameter values: {'learning_rate': 0.25, 'n_estimators': 14}
Train score : 0.7490153737547018
pred_val = grid_search.predict(X_val)
print("Validation score:", metrics.f1_score(y_val, pred_val))
Validation score: 0.757071547420965
metrics.ConfusionMatrixDisplay(confusion_matrix=metrics.confusion_matrix(y_val, pred_val)).plot()
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x2b1ee3fa910>
performance_metrics(y_val, pred_val)
Recall: 0.757071547420965 Specificity: 0.7234848484848485 Precision: 0.757071547420965 F1-score: 0.757071547420965 Balanced Accuracy: 0.7402781979529067
pca = PCA(n_components=None)
pca.fit(X_train)
X_train_pca = pca.transform(X_train)
X_val_pca = pca.transform(X_val)
X_test_pca = pca.transform(X_test)
plt.rcParams['figure.figsize'] = [15, 5]
print(pca.explained_variance_ratio_.cumsum())
plt.plot(pca.explained_variance_ratio_.cumsum(), '-o');
plt.xticks(ticks= range(X_train_pca.shape[1]), labels=[i+1 for i in range(X_train_pca.shape[1])])
plt.xlabel('Principal Components')
plt.ylabel('Variance Explained')
plt.show()
[0.05667402 0.10590517 0.14508473 0.18283884 0.21769369 0.25088518 0.28201313 0.31096007 0.33924632 0.36490784 0.38797049 0.41053378 0.43226598 0.45377246 0.47472923 0.49514037 0.51533088 0.53523005 0.55478265 0.57415596 0.59334971 0.61244475 0.63138635 0.65028774 0.66909677 0.68772755 0.70630321 0.72470787 0.7430054 0.76125049 0.77933777 0.79724634 0.81511447 0.83266108 0.85001506 0.86700541 0.88381721 0.90045208 0.91693923 0.9333307 0.94915756 0.96445279 0.9790912 0.99190021 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. ]
X_train_pca2 = X_train_pca[:, 0:39]
X_val_pca2 = X_val_pca[:, 0:39]
X_test_pca2 = X_test_pca[:, 0:39]
Using any model is fine here, as all of them gives around the same result. But I'm using SVM Classifier because it gives mariginally higher f1-score.
estimator = SVC(kernel ='rbf')
estimator.fit(X_train_pca2,y_train)
SVC()
pred_train = estimator.predict(X_train_pca2)
print("Train score:", metrics.f1_score(y_train, pred_train))
Train score: 0.7572047466566207
pred_val = estimator.predict(X_val_pca2)
print("Validation score:", metrics.f1_score(y_val, pred_val))
Validation score: 0.7529215358931554
parameters = {
'C': [1, 10, 100, 1000],
'gamma': [0.001, 0.01, 0.1, 1]
}
grid_search = GridSearchCV(estimator, parameters, cv=10, n_jobs=-1, scoring='f1')
grid_search.fit(X_train_pca2,y_train)
GridSearchCV(cv=10, estimator=SVC(), n_jobs=-1,
param_grid={'C': [1, 10, 100, 1000],
'gamma': [0.001, 0.01, 0.1, 1]},
scoring='f1')
print("Best hyperparameter values: ", grid_search.best_params_)
print("Train score :", grid_search.best_score_)
Best hyperparameter values: {'C': 10, 'gamma': 0.001}
Train score : 0.7472882769622254
pred_val = grid_search.predict(X_val_pca2)
print("Validation score:", metrics.f1_score(y_val, pred_val))
Validation score: 0.7572977481234363
metrics.ConfusionMatrixDisplay(confusion_matrix=metrics.confusion_matrix(y_val, pred_val)).plot()
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x2b1e62178e0>
performance_metrics(y_val, pred_val)
Recall: 0.7554076539101497 Specificity: 0.7272727272727273 Precision: 0.7591973244147158 F1-score: 0.7572977481234363 Balanced Accuracy: 0.7413401905914385
Next, I will use SVM Classifer with its best hyperparameters to check its test score. Using either PCA or standardized data is fine here. I will stick with standardized data.
svm = SVC(C = 10, gamma = 0.001, kernel = 'rbf')
svm.fit(X_train, y_train)
SVC(C=10, gamma=0.001)
pred_val = svm.predict(X_val)
print("Validation score:", metrics.f1_score(y_val, pred_val))
Validation score: 0.7572977481234363
pred_test = svm.predict(X_test)
print("Test score:", metrics.f1_score(y_test, pred_test))
Test score: 0.7378787878787879
metrics.ConfusionMatrixDisplay(confusion_matrix=metrics.confusion_matrix(y_test, pred_test)).plot()
<sklearn.metrics._plot.confusion_matrix.ConfusionMatrixDisplay at 0x2b1ee3a6df0>
performance_metrics(y_test, pred_test)
Recall: 0.7279521674140508 Specificity: 0.7201365187713311 Precision: 0.7480798771121352 F1-score: 0.7378787878787879 Balanced Accuracy: 0.724044343092691
We can see that the model is not overfitting. Though the reason behind low accuracy is unbalanced dataset. We can try intense hyperparameter tuning here, but it will only lead to overfitting. Maybe we can try cleaning the data again with different approaches. In any case more data is required for further analysis.
I tried different values for target, but my current combinations gave me the best F1-score. It checks if any of the test(GAD, SWL, SPIN) has concerning value. In one combination, I even reached 82% accuracy. But the specificity was very low and the model couldn't be trusted, that's why after after many iterations I used this combination.
From this project I learned how important preprocessing is. Even though I spent days coming up with current solution, I still believe I haven't spent enough time with the data. If I had more time I would have definitely tried using text clustering too. In any case, this project has improved my analysis skill and I'm glad I used such complex dataset for my project.